First off, we need to begin with the usual disclaimers and warnings — not that they are necessarily necessary — but just in case … your mileage may vary, no claims made, use at own risk, don’t try this at home, performed on a closed track with professional driver, and so on. And please note that these ideas are strictly conceptual, at least that I’m aware, and have not been tested, measured, or otherwise.
Reverse spider crawl … at least that’s the name I gave this when the thought first developed. The seeds for this idea came from a couple different areas:
- Sites that were challenged in getting deeper pages crawled and indexed, most likely because they had very deep structures, tons of pages (the 100K’s to the millions of pages level), or had hurdles to crawling (such as heavy parameter-based URLs in the crawl path or the desired destination pages could only be gotten to through what might be perceived as low value or search pages — often seen in ecommerce sites).
- Or maybe it was a site that was (again probably large) needing to get indexed as fully and quickly as possible (maybe an existing site adding new content, a site that made dramatic URL changes — though hopefully 301s were put in place — or maybe it’s a new site).
So, how do sites normally get crawled and indexed? Typically via the following:
- DiscoveryΒ Search engine spiders follow links from elsewhere into the site
- SubmissionΒ A site owner goes to the “submit your site page” at the engines and submits their site (don’t laugh, you know this is still done all the time)
- XML sitemapsΒ A site owner creates an XML sitemap for their site, and may ping or validate it with search engines, or maybe just places it at the root of their site as sitemap.xml and/or uses auto-discovery within robots.txt
Each of those methods have their pros and cons, but they do at least address the issue of getting the spiders into the site, and in the case of XML sitemaps, may help the spiders over some of the crawling hurdles. While it may not be entirely accurate, I tend to think of this as a top-down approach. Like so . . .
Either directly or indirectly, the homepage gets discovered, and the spiders work their way down through the site — top navigation down through the secondary or sub-navigation, crawling their way through top categories, through the layers of sub-categories, on down to the deeper layers of product or information pages — which ironically may be the pages you want indexed the most to capture both head and longtail searches. These, after all, may also be the money or conversion pages.
The wild card, however, is how frequently the spiders come to the site, how many pages they crawl at a time, how often new pages are found and crawled versus crawling existing pages…essentially, crawl equity, which is another reason why eliminating duplicate content is so important — you don’t want to waste that spider love on content that has already been crawled and indexed.
When you multiply this out, for some of these mammoth sites, that’s an awful lot of URLs to crawl and spider love to go around…even without any of the potential crawling hurdles. Keep in mind that, even with XML sitemaps, at a 50,000 URL limit per sitemap, that’s a minimum of 20 separate sitemap files plus a sitemap index file for a site with 1 million pages.
Again, let’s remember that this idea isn’t applicable to every site or situation, but is probably more in line with the scenarios, or similar, laid out above.Β
The foundation of this “reverse spider crawling” is to use the XML sitemap a little bit differently. The typical XML sitemap strategy would be to feed the spiders every URL, in the hopes of getting every page indexed — this may not actually be the best strategy to begin with, but that’s another discussion. Even then, that strategy often places more priority or importance on the top level pages.
Rather, we focus the sitemap(s) on the lowest level pages, which are probably the individual product or information pages. The idea here is to give extra focus to the pages that may be the hardest or deepest ones for the spiders to get to.
Where are they going to go from there? Well, they’re going to do what spiders do best — crawl. Think about these deep pages for a second. Hopefully they are content and keyword-rich. It is probably safe to say that they have some navigational elements on them…if not the full top-level navigation, at least some form of their category or silo navigation. And if we are really fortunate, there is some form of breadcrumb navigation in place and we literally feed the spiders on breadcrumbs.
At this point, we’ve probably opened up more of the site to the spiders, and in many ways, the concept is probably less about reverse crawling and more about crawling from both ends because the top-level navigation is being seen as well. More than likely, we’ve introduced the spiders to more unique URLs at once as well.
Again, under the hypothetical thought of spiders naturally finding the site’s homepage and crawling “down” through the site, how many unique URLs are there from homepage to top-level page to the rest of the top-level pages, as compared to starting with various product-level pages? My guess is that at that level or the next level of crawled URLs, the numbers are dramatically different and in favor of the deeper pages.
The beauty to this, of course, is that I tend to believe that it is actually a rather low-risk approach (however, as mentioned, this is only conceptual and untested at this point). Spiders will naturally find pages whether a XML sitemap is in place or not, and just as important, will find pages of a site that aren’t in the sitemap, so having a sitemap without the homepage and top level pages doesn’t remove or exclude those pages in any way.
In the wild, of course, perhaps none of this matters. Perhaps the spiders crawl the site quickly anyway. Perhaps they focus their effort toward the root once they find links to it. Perhaps these pages are so deep that they get so little love anyway.
Either way, if it were critical for these pages to get crawled and indexed and my site were having a hard time getting that to happen, or time was a factor and it weren’t happening quick enough, I think I’d give this a try at least. Once the indexing of these pages or the entire site was at the level I desired, I might go ahead and add in the rest of the URLs into the XML sitemap(s), or maybe I’d experiment with them in there versus not being in there.
Some of the magical questions are:
- Do more pages get indexed overall?
- Does it help deeper pages get indexed better or quicker, or both?
- Does it increase the overall rate of indexing and get more pages indexed in a shorter period of time?
Most of all, this is a concept open for discussion. What ideas or variations come to mind? Are there other reasons or instances that you might want to try this? What concerns do you have, if any? And perhaps most importantly, what experiences do you have that you want to share?